The purpose of this script is to fit the Cubist and Random forest models to the training data set and generate predictions on the test data set.
For the training dataset, the response period measures the crime rates for Jan 1, 2019 - Dec 31, 2019. We use the index date 2018-12-31 to label this script. So the index date is the observation date when predictors are known before fitting the model to the training period response.
The base date is 2019-12-31 in this document. It is the last date before the start of the test period. For the test dataset, the response period measures the crime rates for Jan 1, 2020 - Dec 31, 2020.
Both models will be fitted on the training dataset to predict the response of the test dataset.
grid_cellsize = 490 # Hard-coded
loadRDSModel = TRUE # Skips Cubist Tuning entirely. If TRUE then tuningFull is not used.
loadRandomForestModel = TRUE # Skips Random Foret Tuning entirely. If TRUE then tuningFull is not used.
tuningFull = TRUE # Tune the model with various calibration (TRUE = takes a long time, FALSE takes less time) Only applies if loadRDSModel == TRUEall_data_on_grid_training_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2018-12-31.geojson")
all_data_on_grid_test_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2019-12-31.geojson")To generate the training and test data sets, we load the grid files and strip out non-predictor or redundant columns. Since we are predicting crime frequencies, we strip out the crime outs. Since the models don’t require the grid geometries, we drop the geometry column.
We train the Cubist model using caret with 10-100 committees and 0-9 neighbors. The selection criteria is the best model based on RMSE.
set.seed( 1027)
cubistControlFull <- trainControl(method = "cv" , selectionFunction = "best")
tuneGridFull <- expand.grid( committees = c( 10, 50, 100 ) ,
neighbors = c( 0, 1, 5, 9 )
)
cubistControlLight <- trainControl(method = "cv" , selectionFunction = "best", verboseIter = TRUE)
tuneGridLight <- expand.grid( committees = c( 100 ) ,
neighbors = c( 0 ) )
rdsFileName = "cubistTune_train_2018-12-31.rds"
if(loadRDSModel == FALSE ){
if(tuningFull == TRUE)
{
cubistControl = cubistControlFull
cubistGrid = tuneGridFull
} else {
cubistControl = cubistControlLight
cubistGrid = tuneGridLight
}
(cubistTune = caret::train( x = X_vars_on_grid_train,
y = Y_on_grid_train ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
saveRDS(cubistTune, file = rdsFileName)
} else {
cubistTune = readRDS(rdsFileName)
}
cubistTune## Cubist
##
## 19782 samples
## 48 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.8139778 0.6795867 0.2316203
## 10 1 0.9730717 0.5662784 0.3144288
## 10 5 0.8288617 0.6555496 0.2856498
## 10 9 0.8124881 0.6689108 0.2801826
## 50 0 0.8113386 0.6836329 0.2309959
## 50 1 0.9710231 0.5692205 0.3137215
## 50 5 0.8266854 0.6581639 0.2849059
## 50 9 0.8096557 0.6723146 0.2793732
## 100 0 0.8109074 0.6842842 0.2307630
## 100 1 0.9709695 0.5691123 0.3135567
## 100 5 0.8260157 0.6588417 0.2847187
## 100 9 0.8092202 0.6726947 0.2792815
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 9.
The Cubist dotplot of the splits are not that useful because they take a long time to run.
Next, we consider the prediction on the test set.
Let’s use PAI and AUC to measure the quality of the hotspot predictions.
We can define hotspots as the top 100 grid cells by predicted crime frequency.
## [1] "Area in miles^2 of Raleigh Buffered at 100 feet: 156.116262020913 miles"
## Loading required namespace: raster
Now we evaluate the Random forest predictive accuracy
set.seed( 1027)
rfControlFull <- trainControl(method = "cv" , selectionFunction = "best", verboseIter = TRUE)
rfTuneGridFull <- expand.grid( .mtry = seq(2, 20, by = 4) , .min.node.size = c( 5, 10 ) , .splitrule = "variance" )
rfControlLight <- trainControl(method = "cv" , selectionFunction = "best", verboseIter = TRUE)
rfTuneGridLight <- expand.grid( .mtry = c(5, 10) , .min.node.size = c(5), .splitrule = "variance" )
rfRdsFileName = "randomForestTune_train_2018-12-31.rds"
if(loadRandomForestModel == FALSE ){
if(tuningFull == TRUE)
{
print("tuningFull is TRUE")
rfControl = rfControlFull
rfGrid = rfTuneGridFull
} else {
print("tuningFull is FALSE")
rfControl = rfControlLight
rfGrid = rfTuneGridLight
}
(randomForestTune = caret::train( x = X_vars_on_grid_train,
y = Y_on_grid_train ,
method = "ranger",
tuneGrid = rfGrid ,
verbose = FALSE,
metric = "RMSE" ,
importance = "impurity" ,
num.trees = 1000,
trControl = rfControl ) )
saveRDS(randomForestTune, file = rfRdsFileName)
} else {
randomForestTune = readRDS(rfRdsFileName)
}The tuning results are shown below. There is not much variation in the MAE or R-Squared after we pick mtry greater than 2.
## Random Forest
##
## 19782 samples
## 48 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 2 5 0.9485338 0.6589756 0.3607999
## 2 10 0.9511516 0.6566675 0.3614408
## 6 5 0.8215157 0.6817859 0.2959122
## 6 10 0.8253773 0.6790895 0.2969019
## 10 5 0.8095380 0.6820717 0.2914426
## 10 10 0.8098394 0.6827083 0.2916268
## 14 5 0.8075560 0.6802940 0.2909837
## 14 10 0.8046769 0.6832490 0.2908628
## 18 5 0.8033923 0.6821224 0.2902635
## 18 10 0.8037805 0.6822839 0.2907471
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 18, splitrule = variance
## and min.node.size = 5.
Let’s use PAI and weighted ROC to measure the quality of the hotspot predictions.
| rank | id_grid_rf | cum_crime_freq_rf | prop_cum_area_rf | prop_cum_crime_freq_rf | PAI_rf | RRI_rf | id_grid_cu | cum_crime_freq_cu | prop_cum_area_cu | prop_cum_crime_freq_cu | PAI_cu | RRI_cu |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10922 | 38 | 0.000 | 0.007 | 121.722 | 1.102 | 10922 | 38 | 0.000 | 0.007 | 121.722 | 1.262 |
| 2 | 12189 | 44 | 0.000 | 0.008 | 70.470 | 1.834 | 12189 | 44 | 0.000 | 0.008 | 70.470 | 2.064 |
| 3 | 12401 | 64 | 0.000 | 0.011 | 68.335 | 1.851 | 12401 | 64 | 0.000 | 0.011 | 68.335 | 2.059 |
| 4 | 10923 | 125 | 0.000 | 0.022 | 100.100 | 1.199 | 10923 | 125 | 0.000 | 0.022 | 100.100 | 1.321 |
| 5 | 12649 | 158 | 0.000 | 0.028 | 101.221 | 1.105 | 6464 | 150 | 0.000 | 0.027 | 96.096 | 1.265 |
| 6 | 11101 | 161 | 0.000 | 0.028 | 85.953 | 1.225 | 12649 | 183 | 0.000 | 0.032 | 97.698 | 1.169 |
| 7 | 6464 | 186 | 0.000 | 0.033 | 85.114 | 1.176 | 8551 | 200 | 0.000 | 0.035 | 91.520 | 1.174 |
| 8 | 14140 | 205 | 0.000 | 0.036 | 82.082 | 1.170 | 12613 | 211 | 0.000 | 0.037 | 84.484 | 1.210 |
| 9 | 4349 | 225 | 0.000 | 0.040 | 80.080 | 1.154 | 4349 | 231 | 0.000 | 0.041 | 82.215 | 1.194 |
| 10 | 12613 | 236 | 0.001 | 0.042 | 75.596 | 1.181 | 14140 | 250 | 0.001 | 0.044 | 80.080 | 1.185 |
## [1] "AUC for Cubist Model is: 0.876214710199852"
## [1] "AUC for Random Forest is: 0.888340064380848"
| Rank | RF | Cubist | RF | Cubist | RF | Cubist | Area |
|---|---|---|---|---|---|---|---|
| 1 | 121.72 | 121.72 | 38 | 38 | 0.67 | 0.67 | 0.01 |
| 5 | 101.22 | 96.10 | 158 | 150 | 2.79 | 2.65 | 0.03 |
| 25 | 53.94 | 52.66 | 421 | 411 | 7.44 | 7.26 | 0.14 |
| 100 | 30.76 | 28.99 | 952 | 898 | 16.82 | 15.87 | 0.55 |
| 200 | 23.50 | 23.44 | 1461 | 1457 | 25.82 | 25.75 | 1.10 |